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Abstract 

The triplet distance is a distance measure that compares two rooted trees on the same set of leaves by 
enumerating all sub-sets of three leaves and counting how often the induced topologies of the tree are equal or 
different. We present an algorithm that computes the triplet distance between two rooted binary trees in time 
0 (n log 2 n). The algorithm is related to an algorithm for computing the quartet distance between two unrooted 
binary trees in time 0 (n log n). While the quartet distance algorithm has a very severe overhead in the asymptotic 
time complexity that makes it impractical compared to 0 (n 2 ) time algorithms, we show through experiments that 
the triplet distance algorithm can be implemented to give a competitive wall-time running time. 



Background 

Using trees to represent relationships is widespread in 
many scientific fields, in particular in biology where trees 
are used e.g. to represent species relationships, so called 
phylogenies, the relationship between genes in gene 
families or for hierarchical clustering of high-throughput 
experimental data. Common for these applications is that 
differences in the data used for constructing the trees, or 
differences in the computational approach for constructing 
the trees, can lead to slightly different trees on the same 
set of leaf IDs. 

To compare such trees, distance measures are often 
used. Common distance measures include the Robinson- 
Foulds distance [1], the triplet distance [2], and the quartet 
distance [3] . Common for these three distance measures is 
that they all enumerate certain features of the trees they 
compare and count how often the features differ between 
the two trees. The Robinson-Foulds distance enumerates 
all edges in the trees and tests if the bipartition they 
induce is found in both trees. The triplet distance (for 
rooted trees) and quartet distance (for unrooted trees) 
enumerate all subsets of leaves of size three and four, 
respectively, and test if the induced topology of the leaves 
is the same in the two trees. 
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Efficient algorithms to compute these three distance 
measures exist. The Robinson-Foulds distance can be 
computed in time O («) [4] for trees with n leaves, which 
is optimal. The quartet distance can be computed in time 
O (« log n) for binary trees [5], in time O {d 9 n log n) for 
trees where all nodes have degree less than d [6], and in 
sub-cubic time for general trees [7]. See also Christiansen 
et al. [8] for a number of algorithms for general trees with 
different tradeoffs depending on the degree of inner 
nodes. For the triplet distance, O [n ) time algorithms 
exist for both binary and general trees [2,9]. 

Brodal et al. [5] present two algorithms for computing 
the quartet distance for binary trees; one running in time 
O (h log n), and one running in time O (n log n). The lat- 
ter is the most practical, and was implemented in [10,11], 
where it was shown to be slower in practice compared to 
a simple O [n ) time algorithm [12] unless n is above 
2000. In this paper we focus on the triplet distance and 
develop an O (n log 2 n) time algorithm for computing this 
distance between two rooted binary trees. The algorithm 
is related to the O (« log 2 n) time algorithm for quartet 
distance, but its core accounting system is completely 
changed. As we demonstrate by experiments, the resulting 
algorithm is not just theoretically efficient but also Effi- 
cient in practice, as it is faster than a simple O (n ) time 
algorithm based on [12] already for n larger than 12, and 
is e.g. faster by a factor of 50 when n is 2900. 
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Methods 

The triplet distance measure between two rooted trees 
with the same set of leaf IDs is based on the topologies 
induced by a tree when selecting three leafs of the tree. 
Whenever three leaves, a, b and c, are selected, a tree can 
induce one of four topologies: It can either group a and b, 
a and c, or b and c, or it can put them at equal distance to 
the root, i.e. all three pairs of leaves have the same lowest 
common ancestor (see Figure 1). For binary trees, the last 
case is not possible since this would require a node with at 
least three children. 

The triplet distance is the number of triplets whose 
topology differ in the two trees. It can naively be com- 
puted by enumerating all O (n 3 ) sets of three leafs and 
comparing the induced topologies in the two trees, count- 
ing how often the trees agree or disagree on the topology. 
Triplet topologies in a tree, however, are not independent, 
and faster algorithms can be constructed exploiting this, 
comparing sets of triplet topologies faster. Critchlow et al. 
[2] for example exploit information about the depth of 
shared ancestors of leaves in a tree to achieve an O (n ) 
time algorithm for binary trees while Bansal et al. [9] con- 
struct a table of shared leaf-sets and achieve an O (k 2 ) 
time algorithm for general trees. 

For the quartet distance, the analogue to the triplet dis- 
tance for unrooted trees, Brodal et al. [5] construct an 
even faster algorithm by identifying sets of four different 
leaves in one tree through coloring all leaves with one of 
three different colors and then counting the number of 
topologies compatible with the coloring in the other tree. 
Using a variant of the so-called "smaller half trick" for 
keeping the number of different relevant colorings low, 
the algorithm manages to construct all relevant colorings 
with O (n log n) color changes. The number of topologies 
compatible with the coloring can then be counted in the 
other tree using a data structure called a "hierarchical 
decomposition tree". Maintaining the hierarchical decom- 
position tree, however, involves a number of polynomial 
manipulations that, while theoretically can be done in con- 
stant time per polynomial, are quite time consuming in 
practice [10,11], making the algorithm slow in practice. 

A naive algorithm that computes the quartet distance 
between two unrooted trees by explicitly inspecting each 
of the O (w 4 ) quartets can be modified to compute the 
triplet distance between two rooted trees without loss of 



time by adding a new leaf x above the two root nodes 
and limit the inspection of quartets to the quartets con- 
taining this new leaf. However, the efficient algorithms 
for computing the quartet distance presented in [5,7] do 
not explicitly inspect every quartet and therefore cannot 
be modified to compute the triplet distance following 
this simple approach. 

In the following, we develop an efficient algorithm for 
computing the triplet distance between two rooted bin- 
ary trees T x and T 2 with the same set of leaf IDs. Our 
key contribution is to show how all triplets in one tree, 
say T lt can be captured by coloring the leaves with col- 
ors, and how the smaller half trick lets us enumerate all 
such colorings in time O (n log n). We will then con- 
struct a hierarchical decomposition tree (HDT) for T 2 
that counts its number of compatible triplets. Unlike the 
algorithms for computing the quartet distance [5], 
where the counting involves manipulations of polyno- 
mials, the HDT for triplets involves simple arithmetic 
computations that are efficient in practice. 

Counting shared triplets through leaf colorings 

A triplet is a set {a, b, c] of three leaf IDs. For a tree T, 

we assign each of the ( " ^ triplets to the lowest com- 
mon ancestor in T of the three leaves containing a, b, 
and c. For a node v e T we denote by r„ the set of tri- 
plets assigned to v. Then {r v | v e T} is a partition of 
the set 7" of triplets. Thus, {r v n t u \ v T lt u e T 2 } is 
also a partition of f. Our algorithm will find 

Shared(T) = X! X! Shared(r„ n r„), 

i/eTi ueT 2 

where Shared(S) on a set S of triplets is its number of 
triplets having the same topology in T l and T 2 . The tri- 
plet distance of 7\ and T 2 is then f — Shared (T) • 

In the algorithm, we capture the triplets r v by a color- 
ing of the leaves: if all leaves not in the subtree of v 
have no color, all leaves in one subtree of v are "red", 
and all leaves in the other subtree are "blue", then x v is 
exactly the triplets having two leaves colored "red" and 
one leaf colored "blue", or two leaves colored "blue" and 
one leaf colored "red". See Figure 2. 
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For such a coloring according to a node v e T x , and 
for a node u e T 2 , the number Shared(r v n x u ) could be 
found as follows: let x and y be the two subtrees of u, 
let x(r) and y(r) be the number of leaves colored red in 
the subtrees x and y, respectively, and x(b) and y{b) the 
number of leaves colored blue. The number Shared(r„ n 

r M ) is then (f) ■ yW + ■ *M + (*>) ■ *m ♦ (*«) • *». 

We call these the triples of u compatible with the 
coloring. 

Explicitly going through 7\ and coloring for each node v 
would take time O (n) per node, for a total time of O (n 2 ). 
We reduce this to O (n log n) by using the smaller half 
trick. Going through T 2 for each coloring and counting 
the number of compatible triplets would also take time O 
(n ). Using a HDT we find this count in 0(1) time, while 
updating the structure takes time O (log n) after each leaf 
color change. The result is 0(n log n) total running time. 
In essence, the HDT performs the inner sum of 

Shared (T) = Y" t Shared ( t„ n t„), while the 

coloring algorithm performs the outer sum. 

Smaller half trick 

We go through nodes v in a depth first order while 
maintaining two invariants of the algorithm: 1) Before 
we process v, the entire subtree of v is colored "red" 
and the rest of the tree has no color; 2) When we return 
from the depth first recursion, the entire tree has no 
color. 

For v let S(v) denote the smallest subtree of v and let 
L{v) denote the largest subtree of v. We go through the 
coloring as follows (see Figure 3): 

1. Color S(v) "blue". Now v has the coloring that 
enable us to count the triplets for v. 

2. Remove the color for S(v). Now we can call recur- 
sively on L(v) while satisfying the invariant. 

3. Returning from the recursive call, the entire tree 
is colorless by invariant 2. 

4. Color S(v) "red". Now we satisfy invariant 1 for 
calling recursively on S{v). 



5. Call recursively on S{v). When we return we 
satisfy invariant 2 for returning from the recursive 
call. 

Using this recursive algorithm, we go through all col- 
orings of the tree. In each instance (not counting recur- 
sive calls), we only color leaves in S(v), and only a 
constant number of times. Thus, a leaf i is only colored 
when visiting an ancestor node v where £ S(v), i.e. I is 
in the smaller subtree of v. Since I can have at most O 
(log n) such ancestors, each leaf will only be colored at 
most O (log n) times, implying a total of O (« log ri) 
color changes. 

Hierarchical decomposition tree 

We build a data structure, the hierarchical decomposi- 
tion tree (HDT), on top of the second tree T 2 in order 
to count the triplets in T 2 compatible with the coloring 
of leaves in the first tree 7\. The HDT is a balanced 
binary tree where each node corresponds to a connected 
part of T 2 . Each node in the HDT, or component, keeps 
a count of the number of compatible triplets the corre- 
spondent part of T 2 contains, plus some additional 
book-keeping that makes it possible to compute this 
count in each component in constant time using the 
information stored in the component's children in the 
HDT. 

The HDT contains three different kinds of compo- 
nents: 

• L: A leaf in T 2 , 

• I: An inner node in T 2 , 

• C: A connected sub-part of T 2 , 

where for type C we require that at most two edges in 
T 2 crosses the boundary of the component; at most one 
going up towards the root and at most one going down 
to a subtree. 

The leaves and inner nodes of T 2 are transformed into 
L and I components, respectively, and constitute the 
leaves of the HDT. C components are then formed by 
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pairwise joining other components along an edge in T 2 
by one of two compositions, see Figure 4. C components 
can be thought of as consisting of a path from a sub- 
tree below the C component going up towards the root 
of T 2 , such that all trees branching o to other children 
along the path are all contained in the component. In 
the following we show how the HDT of T 2 can be con- 
structed in time 0(n), and we prove that the height of 
the HDT is O (log n). 

The construction algorithm operates on components 
and edges. Each component is of one of the types L, I, 
or C. It has a parent pointer, pointing to its parent com- 
ponent in the HDT, and a bit, down_closed, indicating if 
the component corresponds to a complete subtree in 
the original tree. When a component does not yet have 
a parent in the HDT, we make the parent pointer point 
to the component itself. We use this pointer to test if 
an edge has been contracted in the HDT construction. 
Edges consist of two pointers, an up pointer and a down 
pointer that points to the components above and below 
the edge in T 2 . 

In a single traversal of the tree, the algorithm initially 
builds a component for each node in the tree (an L 
component for each leaf and an I component for each 
inner node) and an edge for each edge in the tree. The 
parent pointer of each component is initially set to 
point to the component itself, and the down_closed bit 
is set to true for L components and false for I compo- 
nents. The edges are put in a list es. We then perform a 
series of iterations, each constructing one level of the 
HDT. In each iteration, edges whose neighbors can be 
joined to form a C component via the constructions in 
Figure 5 are greedily removed from es, and another list, 
next, is used to keep the edges that cannot be con- 
tracted in this iteration. The details of an iteration is 
shown in Figure 6. The process stops when es becomes 
empty. 

In case 1, one of e's neighbors already has a parent, 
thus this neighbor has already been contracted into a C 
component in this iteration and should not be con- 
tracted again. Case 2 is the situation in Figure 5a, and 
case 3 is the situation in Figure 5b. In case 4, e.down is 
an I component or e.up is an I component and e.down 
does not correspond to a complete subtree, hence none 
of the constructions in Figure 5 apply. After removing 
all edges from es, the up and down pointers of the 



remaining edges in next are updated in lines 21-22, such 
that they point to either a newly created component or 
still point to the same component. The edges in next 
are finally moved back to es in line 23, and a new itera- 
tion is started. The algorithm finishes when es is empty, 
and the root of the HDT is the C component resulting 
from joining the ends of the last edge. We now argue 
that height of the HDT is 0(log «), and that the con- 
struction time is 0(«). Each iteration of the code in 
Figure 6 takes time linear in the number E = \es\ of 
edges remaining to be contracted at the beginning of 
the iteration. The height and time bounds follow if we 
can argue that the number of edges decreases geometri- 
cally for each iteration. In the following we argue that the 
number of edges after one iteration is at most 11/12. E. 

We first argue that the number of contractible edges at 
the beginning of the iteration is at least El A. Note that 
only edges incident to I components might not be con- 
tractible, and that the number of down-closed compo- 
nents is at least one larger than the number of I 
components. If the number of I components is at most 
E/4, then at most 3 • E/4 incident edges might not be 
contractible, i.e. at least E/4 edges are contractible. 
Otherwise the number of I components is more than 
E/4, and therefore the number of down-closed compo- 
nents is more than E/4 + 1. Since the parent edges from 
all down-closed components are contractible (for E > 1), 
the number of contactable edges is again at least E/4. 

Since each contracted edge can prevent at most two 
other edges incident to the two merged components 
(see Figure 5) from being contracted in the same itera- 
tion, each iteration will contract at least 1/3 of the at 
least E/4 contractible edges. It follows that an iteration 
reduces the number of contractible edges by at least 
E/12. 

Counting triplets in the hierarchical decomposition tree 

In each component we keep track of N, the number of 
triplets contained within the component (i.e., where the 
three leaves of the triplet are within the component) that 
are compatible with the coloring. When we change the 
coloring, we update the HDT to reflect this, so we can 
always read o the total number of compatible triplets 
from the root of the HDT in constant time. 

By adding a little book-keeping information to each 
component we make it possible to compute N (and the 
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(L) 



(I) 



(C) 



Figure 4 Component types in the HDT. The three different types of components. L and I components contain a single node from the 
underlying tree while C components contain a connected set of nodes. 



book-keeping information itself) for a component in 
constant time from the information in the component's 
children in the HDT. This has two consequences: it 
makes it possible to add the book-keeping and N to all 
components after the construction in linear time, and it 
makes it possible to update the information when a leaf 
changes color by updating only components on the path 
from the leaf-component to the root in the HDT, a path 
that is bounded in length by O (log n). Since we only 
change the color of a leaf O (n log n) times in the triplet 
distance algorithm, it is this property of the HDT that 
keeps the total running time at O (n log 2 «). For the 
book-keeping we store six numbers in each component 
in addition to N: 

♦ R: The number of leaves colored red in the 
component. 

♦ B: The number of leaves colored blue in the 
component. 

♦ fx '• The number of pairs of red leaves found in the 
same sub-tree of a C component. Let T b i = 1,..., n, 
denote the sub-trees in the component, see Figure 
7a, and let r(i) denote the number of red leaves in 

treeT, Then n = ^ . 

♦ yb : The number of pairs of blue leaves found in 
the same sub-tree of a C component. 

♦ rb '■ The number of pairs of leaves with a red leaf 
in one sub-tree and a blue in another sub-tree, 
where the red leaf is in a tree further down on the 
path in a C component. Let T t , i = 1,..., n, denote 
the sub-trees in the component, see Figure 7, and let 
r(i) denote the number of red leaves in tree T, and b 



(i) the number of blue leaves in T t . Then 

♦ b r : The number of pairs of leaves with a red leaf in 
one sub-tree and a blue in another sub-tree, where 
the blue leaf is in a tree further down on the path in 
a C component. 

We describe how the book-keeping variables and N 
are computed through a case-analysis on how the com- 
ponents are constructed. L and I components are con- 
structed in only one way, while C components are 
constructed in one of two ways (see Figure 5). 

L components: For a leaf component, R is 1 if the leaf 
is colored red and 0 otherwise, B is 1 if the leaf is 
colored blue and 0 otherwise, and all other counts are 0. 

I components: All counts are 0. 

C components, case Figure 5a: Let x be one of the 

counters R, B, ff,- listed above for a C component, and 
let x{l) and x(2) denote the corresponding counter in 
component C\ and C2, respectively, with C2 above C\ in 
the underlying tree. Then 



R = R(l) +R(2) 
rr = rr{\)+rr{2) 
fb = rf>(2) + fb(l)+R(l)-B(2) 



B = B(l) + B(2) 

bb = bb(\) + bb{2) 

br = br(2) + br{l) + B(\) ■ R(2). 



The triplet count is then computed as 

N = N(1) + N(2) + ( R( 2 1) ) • B(2) + ( B( 2 1) ). R(2) 

+ n{2) ■ B(l) + bfc(2) • R(l) + R(l) • ?b{2)+B{\) ■ br{2). 

C components, case Figure 5b: Let #(1) denote one 
of the counters listed above for component C\. Then 
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(a) 



(b) 



Figure 5 Component compostions in the construction of the HDT. The two different ways of constructing a C component by merging two 
underlying components. The topmost of the components can either be a C component (a) or an I component (b) while the bottommost 
component, C,, must be a C or L component. If the topmost component is an I component, the bottommost must be downwards closed, i.e. it 
cannot have a downwards edge crossing its boundary. 



R 



R(l), B = B(l), n 



2 



bb 



(bid). 



r \f _ \j X _ q. Since the inner node in the composition 
does not contain any leaves, the triplet count is simply 
N = N (1). 



Results and discussion 

We implemented the algorithm in C++ and a simple O 
(n 2 ) time algorithm to ensure that it computes the cor- 
rect triplet distance. 



for 


each edge e in es 






if e. up. parent ^ e . up or e . down . parent 


^ e . down 




/* Case 1 : At least one of the end- 


-points of e is already 




contracted */ 






move e from es to next 






else if e. up. type = C and (e. down. type 


— C or e. down. type = L) 




/* Case 2: Fig . 5(a) . */ 






remove e from es 






make a new component c 






c . down.closed = e . down.closed 






e. up. parent = e . down . parent = c 






else if c. up. type = I and e . down . down. 


closed = True 




/* Case 3: Fig . 5(b) . */ 






remove e from es 






make a new component c 






c . down.closed = False 






e. up. parent = e . down . parent = c 






else 






/* Case 4 */ 






move e from es to next 




for 


each edge e in next 






e . up = e . up . parent 






e . down = e . down . parent 






move e from next to es 





Figure 6 Algorithm for constructing the hierarchical decomposition tree. Algorithm for constructing the hierarchical decomposition tree. 
The listing shows the algorithm run for each level of the HDT construction. This algorithm is repeated until the es list is empty. 



Sand et al. BMC Bioinformatics 2013, 14(Suppl 2):S18 
http://www.biomedcentral.com/1471-2105/14/S2/S18 



Page 7 of 9 




We then verified the running time of our algorithm, see 
Figure 8a. As seen in the figure, the running time evens 
out when we divide with n log 2 n giving us confidence 
that the analyzed running time is correct. We also mea- 
sured where in the algorithm time was spent, whether it is 
in constructing the HDT, in coloring the leaves in the first 
tree, or in updating the counts in the HDT. Figure 8b 
illustrates the time spend on each of these three parts of 
the algorithm, normalized so the running time sums to 
one. For small trees, constructing the HDT makes up a 
sizable fraction of the running time, not surprising since 
the overhead in constructing components is larger than 
updating them. As the size of the trees increase, more 
time is spent on updating the HDT, as expected since 
updating the HDT runs in O (« log 2 n) while the other 
operations are asymptotically O (n log «). 



When changing the color of leaves, we spent time 
O (log n) updating the book-keeping in the HDT for 
each leaf. We only count after a complete sub-tree has 
changed color, however, so instead of updating the HDT 
for each color-change we could just mark which leaves 
have changed color and then update the HDT bottom- 
up, so each inner node would only be updated once 
when it is on a path from a changed leaf. We implemen- 
ted this, but found that the extra book-keeping from 
marking nodes and then updating increased the running 
time by 10%-15% compared to just updating the HDT. 

To render the use of the algorithm in practice, we 
implemented an Efficient 0(n 2 ) time algorithm based on 
the quartet distance algorithm presented in [12]. Figure 9 
shows the ratio of the running time for the 0(« 2 ) time 
algorithm against the 0(n log 2 n) time algorithm. It is 




(a) Total running time divided by n log 2 n. 




(b) The percent of time used on the three parts of the 
algorithm. 



Figure 8 Validation of running time, (a) Total running time divided by n log 2 n.. (b) The percent of time used on the three parts of the 
algorithm. The left figure shows the total running time divided by n log 2 n showing that the theoretical running time is achieved in the 
implementation. The right figure shows the percent of time used on the three parts of the algorithm: constructing the HDT of the second tree, 
coloring the leaves in the first tree, and updating the HDT accordingly. Constructing the HDT takes a considerable part of the time, but as the 
trees grow, updating the HDT takes a larger part. The plots show the average over 50 experiments for each size n. 
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Comparison to running time of 0(n 2 ) algorithm 



t O{n 1 )l t O{n log 2 n) 




n = number of leaves 

Figure 9 Comparison to 0(n 2 ) algorithm. Ratio between the running times for the 0(n 2 ) time algorithm and the 0(n log 2 n) time algorithm. 



evident that our algorithm is fastest for all practical pur- 
poses. The speed-up factor for n = 2900 is 46. 

Conclusions 

We have presented an O (n log 2 n) time algorithm for 
computing the triplet distance between two binary trees 
and experimentally validated its correctness and time 
analysis. 

The algorithm builds upon the ideas in the O (n log 2 n) 
time algorithm for computing the quartet distance 
between binary trees [13], but where the book-keeping in 
the quartet distance algorithm is rather involved, making 
it inefficient in practice, the book-keeping in the triplet 
distance algorithm in this paper is entirely different, and 
significantly simpler and faster. 

Compressing the HDT during the algorithm makes it 
possible to reduce the running time of the quartet distance 
algorithm to O (n log n) and the same approach can also 
reduce the running time of the triplet algorithm to O (« 
log n). We have left for future work to experimentally test 
whether this method incurs too much overhead to make it 
practically worthwhile. 

Unlike the O (w 2 ) time algorithm of Bansal et al. [9] our 
algorithm does not generalize to non-binary trees. It is 
possible to extend the algorithm to non-binary trees by 
employing more colors, as was done for the quartet 



distance [6], but this makes the algorithm depend on the 
degree of nodes, and future work is needed to develop a 
sub-quadratic algorithm for general rooted trees. 
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